We explore on education factors that may have impacted minority language cultivation and usage using education statistic provided by World Bank(1). The data contains 3620 variables from year 1976 until 2100 (with some future projection population variables), but we limit our research to historical data up to 2016.

Due to the vast number of features, we first visualized the data in form of bitmap to identify if there’s any gaps (missing values) and to see how data are being structured. The data from World Bank is in stacked form, and the columns represent statistic for each year from 1976 onwards. We unstack the data enabling the variables represented in columns and added a new column named ‘year’. The data covers 242 countries, however we will filter and take only countries from language extinction data set.

fn<-createRasterBitmap(filterByYearUnstack(edstats))

We took a snapshot around the rows of 0.6 ratio which is about in the year 2000 and visualize the data again using interactive plot to estimate the variables that contains data.

edstats.2000<-filterByYearUnstack(edstats, 2000, 2000)
heatmap(edstats.2000)

The Barro-Lee columns set shows more complete data sets, having 2 sparse group of enrolment numbers (bitmap x-axis ratio 0.3) and percentage of students in primary, secondary and tertiary education (bitmap x-axis ratio 0.5). Lastly, the columns with dark filled data (bitmap x-axis 0.7) is mainly country population numbers by age range.

We make further research on Barro-Lee(3) data set and it shows the most complete data set on education statistic, therefore we consider this as our primary data set for education statistics. We take Barro-Lee education statistic from year 2000 up to 2016 and aggregate all the years using mean. The main reason to aggregate from yearly data is to provide a compress view for all the country, and to identify countries that do not have any data.

BarroLee<-which(grepl("Barro-Lee.*", names(edstats.2000)))
edstats.barrolee<-groupbyCountryCorrectCountryCode(filterByYearUnstack(edstats, 2000, 2016) %>%
                                                   select (Country.Code, year, c(BarroLee)))
heatmap(edstats.barrolee, scale=T)

Next, we look into secondary data set based on enrolment data for primary, secondary and tertiary education, together with education expenditure and gross enrolment ratio.

Enrolment.start<-min(which(grepl("Enrolment in early childhood.*", names(edstats.2000))))
Enrolment.end<-max(which(grepl("Gross enrolment ratio.*", names(edstats.2000))))
edstats.enrolment<-groupbyCountryCorrectCountryCode(filterByYearUnstack(edstats, 2000, 2016) %>%
                                                   select (Country.Code, year, c(Enrolment.start:Enrolment.end)))
heatmap(edstats.enrolment, scale = T)

There are still a lot of gaps in the data, therefore we further filter the variables and retained variables that only have missing values for less than 20% of total countries.

na.count<-apply(edstats.enrolment, 2, function(x) length(which(is.na(x))))
edstats.enrolment.new<- edstats.enrolment %>% select(Country.Code, which(na.count<(242*0.2)))
heatmap(edstats.enrolment.new, scale = T)

We finally retain total of 93 from original of 227 variables, and this form our secondary data set showing enrolment and GDP spending on education for each country.

Further work

This is only a preliminary EDA work on the education statistic to retain useful data from original data set. Further analysis on the 2 data sets, (1) Barro-Lee education statistic and (2) enrolment number and GDP spending data, to model differences in each country education level and whether it forms a relationship to language endangerment.

Appendix

require(reshape2)
require(dplyr)
edstats<-read.csv("../input/Edstats_csv/Edstats_Data.csv")

Create a smaller size bitmap compare to heatmap chart to visualize the structure of whole data set.

createRasterBitmap<-function(edstats.df, filename="img/edstats.matrix.bitmap.present.png") {
  dm<-data.matrix(edstats.df %>% select( -year, -Country.Code))
  mat1 <- apply(dm, 2, rev)
  png(filename, width = 5, height = 4, units = 'in', res = 300)
  image(t(mat1), useRaster = T, col=heat.colors(100, alpha = 0.5), 
        xlab="Feature columns 2-3621\n (left-right)", 
        ylab="Country stats 1976-2016(top-bottom)", 
        main="Visualizing data\nfor gaps (missing values)")
  dev.off()
  filename
}

Function to unstack the variables and filter by year:

filterByYearUnstack<-function(edstats.df, fromYear=1976, toYear=2016) {
  getYear<-function(yr) {
    as.numeric(gsub("X([0-9]{4})", "\\1", yr))
  }
  edstats.years<-data.frame()
  for(n in names(edstats)) {
    if (grepl("X[0-9]{4}", n) && getYear(n) >=fromYear && getYear(n) <= toYear)
      edstats.years<-rbind(edstats.years, edstats.df %>% 
                                  dcast(Country.Code~Indicator.Name, value.var= n) %>% 
                                  mutate(year=gsub("X([0-9]{4})", "\\1", n)))
  }
  edstats.years
}

Aggregate function for each variables being group by the country code:

groupbyCountryCorrectCountryCode<-function(edstats.df){
  mean.na <-function(x) mean(x, na.rm = T)
  a<-edstats.df %>% select( - year) %>% group_by(Country.Code) %>% summarize_each(funs(mean.na))
  a$Country.Code<-as.character(a$Country.Code)
  a$Country.Code[which(a$Country.Code=="ROM")] <- "ROU"
  a$Country.Code[which(a$Country.Code=="TMP")] <- "TLS"
  a$Country.Code[which(a$Country.Code=="AGO")] <- "ANG"
  a$Country.Code<-as.factor(a$Country.Code)
  a
}

Show heatmap of data to visualize data for normalized values and missing values (centered and scaled if scale=TRUE).

heatmap<-function(edstats.df, scale=FALSE) {
  require(plotly)
  country.code<-(edstats.df %>% mutate(country.year=as.character(Country.Code)) %>% 
                          select(country.year))$country.year
  df<-edstats.df %>% select(-Country.Code)
  features<-colnames(df)
  dm<-data.matrix(df)
  if (scale)
    dm <- scale(dm, scale=T, center = T)
  
  plot_ly(
    x = features, y = country.code,
    z = dm, type = "heatmap", colorscale = "Greys")
}